1 Overview

1.1 Primary Sjögren’s syndrome(pSS)

  • Primary Sjögren’s syndrome (pSS) is a chronic autoimmune disease that affects exocrine glands, such as the tear and saliva glands. The underlying cause of pSS is not fully understood, but it is thought to involve a combination of genetic and environmental factors that trigger an abnormal immune response (Luo et al. 2020). To understand the genetic cause of pSS, we focus on the transcriptomie analysis of exocrine glands from pSS and non-pSS patients.

1.2 Patients and Data

  • As discussed, we selected GSE159574 from the NCBI GEO database, which consists of RNA sequencing with salivary glands with 16 pSS patients and 13 non-pSS patients. Two subgroups both experienced the subjective clinical symptoms of xerostomia or xerophthalmia while pSS patients fulfill the clinical classification criteria of the pSS. The other group serves as the control for the experiment. The original publication for the dataset is here (Luo et al. 2020).

1.3 Data Processings

  • In Assignment 1, we performed data cleaning and normalization on a gene expression dataset. This process involved removing low-quality samples, filtering out genes with low expression levels, and normalizing the remaining data to account for technical variability. After completing these steps, the final coverage of the dataset was 23 samples and 20394 genes. This means that we were left with high-quality data for 23 samples, each of which had expression measurements for 20394 genes.

1.4 Differential Gene Expression Analysis

  • In assignment 2, we conducted a differential gene expression analysis to compare the gene expression profiles of primary Sjögren’s Syndrome (pSS) patients to those of the non-pSS group. The results of this analysis provided us with a ranked gene list covering all the genes using logFC, log fold change . This ranked list has been stored in the file ./A3/pSS.rnk for further investigation in this assignment.

2 Setups

2.1 Dependencies

library(ggplot2) #ploting
library(stringr) #string processing

2.2 Helper Functions

  • We prepare the function for plotting the enrichment results.
#' Plot out dot plot for enrichment analysis results
#'
#' @param data enrichment result
#'
#' @return dot plot
dotplot <- function(data){
  plot <- ggplot(data) + 
                  geom_point(aes(
                              x = es, 
                              color = FDR,
                              y = term_name,
                              size = size)) +
                  theme(axis.title.x = element_text(size = 8),
                        axis.title.y = element_text(size = 8)) +
                 scale_color_gradient(low = "red", high = "blue") +
                  labs(x = "Enrichment Score(ES)",
                       color = "P-value", 
                       size = "Geneset Size", 
                       y = NULL)
  return(plot)
}

3 Non-thresholded Gene Set Enrichment Analysis

3.1 Method

  • In our next step, we will employ Gene Set Enrichment Analysis (GSEA) (Subramanian et al. 2005; Mootha et al. 2003) using the Bader Lab gene set (BaderLab 2016b) for humans to further investigate the biological pathways and processes that may be impacted in pSS group vs. non-pSS group. This analysis will be conducted using GSEA mac software version 4.3.2, a well-established and widely-used computational method to identify and interpret the functional context of differentially expressed genes. The Bader Lab gene set, known for its comprehensive and up-to-date collection of gene sets, will be utilized in its most recent release dated March 2, 2023, ensuring that our analysis is based on the latest available knowledge.

3.2 Enrichment Result

  • We load the pre-computed results (more information available in Journal):
pos_path <- "./A3/gsea_report_for_pos_pSS.tsv"
neg_path <- "./A3/gsea_report_for_neg_pSS.tsv"

pos_enrich <- read.table(file = pos_path, sep = '\t', header = TRUE, fill = TRUE)
neg_enrich <- read.table(file = neg_path, sep = '\t', header = TRUE, fill = TRUE)
  • We visualize the enrichment results in dot plot (we select the most statistically significant ones on p-values):

Enrichment in pSS group:

# perpare the input
pos_package <- data.frame(
  term_name = str_wrap(sub("\\%.*", "", pos_enrich$NAME), 20), # truncate %
  size = pos_enrich$SIZE,
  FDR = pos_enrich$FDR.q.val,
  es = pos_enrich$ES
)

# plot
dotplot(pos_package[1:10,])

Figure 1. The Dot plot displays the GSEA results for 10 significant hits in enriched genes for pSS group.

Enrichment in non-pSS group:

# prepare the input
neg_package <- data.frame(
  term_name = str_wrap(sub("\\%.*", "", neg_enrich$NAME), 20), # truncate %
  size = neg_enrich$SIZE,
  FDR = neg_enrich$FDR.q.val,
  es = neg_enrich$ES
)

# plot
dotplot(neg_package[1:10,])

Figure 2. The Dot plot displays the GSEA results for 10 significant hits in enriched genes for non-pSS group.

  • Figure 1 and Figure 2 provide a clear visual representation of our GSEA results through the use of informative dot plots. It is important to emphasize that all the results presented are statistically significant. The main gene sets enriched in the pSS group are primarily related to immune response, interferon response, and allograft rejection, which is consistent with the established understanding of pSS as an autoimmune disease (Luo et al. 2020). In contrast, the non-pSS group displays enrichment in gene sets involved in skin development and protein synthesis, inferring that the protein synthesis is suppressed in pSS groups.

3.3 Comparison with ORA

  • We pull the results from ORA analysis from A2:

    Up-regulated Genes Down-regulated Genes
    response to other organism tear secretion
    response to external biotic stimulus response to lipopolysaccharide
    response to biotic stimulus regulation of epithelial cell proliferation
    immune system process organ or tissue specific immune response
    immune response neutrophil aggregation
    antigen processing and presentation of peptide antigen via MHC class I mucosal immune response.
    antigen processing and presentation of peptide antigen intermediate filament organization
    antigen processing and presentation of endogenous peptide antigen intermediate filament cvtoskeleton organization
    antigen processing and presentation of endogenous antigen innate immune response in mucosa
    antigen processing and presentation epithelial cell proliferation

Table 1. ORA results from A2 for up-regulated genes and down-regulated genes for pSS condition.

  • By comparing the GSEA results (Figure 1 and Figure 2) and the ORA result (Table 1), the enriched gene sets identified by two distinct analytical methods, demonstrate a strong alignment and consistency in their findings. Both methods reveal significant enrichment in immune response-related gene sets, further substantiating the well-established autoimmune nature of pSS(Luo et al. 2020). On the other hand, for the non-pSS group, the ORA indicates enrichment in cell proliferation, which can be linked to the skin development and protein synthesis gene sets identified in the GSEA results.

  • However, the comparison is not straightforward. This is primarily due to the fact that the gene sets used in each method are not identical. In the GSEA analysis, we utilized the BaderLab curated gene sets (BaderLab 2016b), which are curated from multiple sources, whereas in the ORA analysis, we only covered Gene Ontology Biological Process (GO:BP) gene sets (Mi et al. 2019). As a result, the conclusions we have drawn are based on a qualitative comparison of gene set names rather than a rigorous quantitative analysis.

4 Visualization in Cytoscape

4.1 Graph construction

Figure 3. Enrichment map showing pSS vs. non-pSS with threshold FDR at 0.01.

  • We use Cytoscape(Shannon et al. 2003) and its EnrichmentMap(Merico et al. 2010) app to construct the enrichment map for further analysis. We determined the node threshold based on empirical experience. At a false discovery rate (FDR) of 0.05, the graph includes more than 1,000 gene sets, which poses challenges for downstream analysis. We then further reduced the threshold to an FDR of 0.01, effectively limiting the number of gene sets to 745 and edge counts to 5,914, as illustrated in Figure 3. The edge is constructed using the default algorithm which the edge value is weighted eqaully by Jaccard and overlaps. Edge values lower than 0.375 are filtered out.

4.2 Network Annotation

Figure 4. Annotated enrichment map showing the clustered genesets.

  • We then use AutoAnnotate app(BaderLab 2016a) to construct the annotation after manual adjustment discussed in Journal. MCL clustering algorithm in the app (BaderLab 2016a) is used to show the common labels among nodes. For the labels, we have max word per label leaving default at 3 and minimum word occurrence at 1.

  • As shown in Figure 4, the analysis of enriched gene sets for pSS samples has identified the regulation of immune response as the most common theme. Additionally, some cell differentiation and proliferation keywords have emerged in our findings, highlighting other potential molecular mechanisms at play in pSS. In contrast, for non-pSS groups, our results show enrichment in gene sets related to protein synthesis and cell structure.

4.3 Collapsed network of major biological themes

Figure 5. Collapsed theme network reflecting the enriched gene sets for pSS groups.

  • As shown in Figure 5, we used AutoAnnotate app(BaderLab 2016a) to construct a summary network for the annotated graph. The major themes are consistent with the observation before, including regulation of immune response, cell differentiation and proliferation for pSS group. Some protein synthesis and cell structure themes are related to non-pSS groups.

  • The results fit our proposed model and major findings in the original publication(Luo et al. 2020), e.g. identification of Immuno response. Interestingly, we identified more themes related with skin, cell structure and differentiation than our expectations. Further investigation is needed to determine the relationship between these molecular landscapes to pSS.

5 Post analysis

5.1 Transcription factor signiture study

  • Increasing evidence suggests that pSS may be associated with inheritable factors, and multiple publications have identified potential risk factors linked to mutations in transcription factors (Imgenberg-Kreuz et al. 2021). To further explore this relationship, we performed a post-analysis (signature gene set analysis) with the aim of identifying enriched transcription factors associated with our findings.

  • For this purpose, we used the signature gene set of Transcription Factors from Bader Lab(BaderLab 2016b), utilizing their latest released version from March 2, 2023. To compute the graph, we employed the EnrichmentMap app(BaderLab 2016a) and applied a two-sided Mann-Whitney test to set the threshold at a p-value of 0.05. This allowed us to identify and showcase three selected signature sets as demonstrative examples of the transcription factors potentially implicated in pSS. These transcription factor signature sets are shown in table Table 2:

    Signiture TFs
    IRF5
    STAT4
    ATF4

Table 2. Selected transcription factor genesets for post-analysis.

5.2 Post-analysis results

Figure 6. Selected Transcription Factor signature genesets. (A) STAT4 interacts with multiple immune response genesets. (B) IRF5 is related to protein biosynthesis and cell development. (C) ATF4 enriches to infection signalling pathway.

  • As shown in Figure 6, we closely examined three transcription factor gene sets: STAT4, IRF5, and ATF4. STAT4 is associated with multiple immune response gene sets, while IRF5 is related to protein synthesis and cell development. Notably, both of these transcription factors have been reported as risk factors for pSS (Imgenberg-Kreuz et al. 2021). IRF5, in particular, is a transcription factor that operates downstream of toll-like receptors (TLRs) and type I IFN receptors, promoting the expression of numerous antiviral and pro-inflammatory proteins(Takaoka et al. 2005). Meanwhile, STAT4 is involved in activating the downstream immunomodulatory cytokine secretion(Nordmark et al. 2009). Our findings align with the known roles of these transcription factors in pSS.

  • However, to the best of our knowledge, ATF4 has not been reported in any articles related to pSS. ATF4 is known to mediate programmed cell death by promoting the expression of genes involved in cellular amino acid metabolic processes, mRNA translation, and the terminal unfolded protein response (Ohoka et al. 2005). This suggests that ATF4 plays a role in stress response. Intriguingly, the gene is primarily located on the X chromosome (Ohoka et al. 2005), and given pSS’s high occurrence in females (16 females : 1 male) (Imgenberg-Kreuz et al. 2021), this may imply a potential role for ATF4 in pSS pathogenesis. Nevertheless, further investigation is needed to confirm these preliminary observations and to fully understand the role of ATF4 in the context

6 Interpretation

Our enrichment results lend support to the conclusions and mechanisms discussed in the original paper (Luo et al. 2020). The original paper described pSS as an autoimmune disease, and our findings further emphasize the significant enrichment of immune response in the pSS group. Several other pathogenesis features align with the original paper as well, such as the enrichment of the lymphocytic infiltration gene set and the identification of immune-response-related protein synthesis. When comparing our results with the ORA from the previous assignment, we find that the majority of enrichments are consistent between the two methods, with the exception of pathways in non-pSS groups. The ORA suggests that some pathways enriched in non-pSS groups are related to tear secretion, which is not fully captured in our non-thresholded method. One possible explanation for this discrepancy is that we utilized a more comprehensive and curated gene set from Bader Lab (BaderLab 2016b) in our analysis, allowing us to explore the pathogenesis at a higher resolution.

We would like to highlight a novel finding from our post-analysis (here), which identified a novel transcription factor, ATF4, as being associated with enriched gene sets in pSS, even though it has not been reported in previous pSS studies to the best of our knowledge. Transcription factor ATF4 is known to mediate programmed cell death by promoting the expression of genes involved in cellular amino acid metabolic processes, mRNA translation, and the terminal unfolded protein response (Ohoka et al. 2005). This function aligns with the immune response mechanism of pSS. Additionally, ATF4 has multiple gene replications located on the X chromosome. The X chromosome has been proposed as a promising candidate for involvement in pSS pathogenesis, as the study by Imgenberg et al. (Imgenberg-Kreuz et al. 2021) reported a higher risk for females, with a male-to-female ratio of 1:16. This skewed prevalence suggests a potential connection between the X chromosome and the development of pSS. Taking these factors into account, it may suggest that ATF4, with its roles in cellular processes and location on the X chromosome, may have a potential pathogenic role in pSS. Further investigation is necessary to confirm this hypothesis and elucidate the precise mechanisms through which ATF4 dysfunction might contribute to the development of pSS.

References

BaderLab. 2016a. “Autoannotate.” https://apps.cytoscape.org/apps/autoannotate.
———. 2016b. “GeneSets.” http://baderlab.org/GeneSets.
Imgenberg-Kreuz, Juliana, Astrid Rasmussen, Kathy Sivils, and Gunnel Nordmark. 2021. “Genetics and Epigenetics in Primary Sjögren’s Syndrome.” Rheumatology 60 (5): 2085–98.
Luo, Jing, Xin Liao, Lihe Zhang, Xin Xu, Senhong Ying, Mengjiao Yu, Lixia Zhu, Suxian Lin, and Xiaobing Wang. 2020. “Transcriptome Sequencing Reveals Potential Roles of ICOS in Primary Sjögren’s Syndrome.” Frontiers in Cell and Developmental Biology 8: 592490.
Merico, D., R. Isserlin, O. Stueker, A. Emili, and G. D. Bader. 2010. “Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation.” Genome Biology 11: R94.
Mi, Huaiyu, Anushya Muruganujan, Dustin Ebert, Xiaosong Huang, and Paul D Thomas. 2019. “PANTHER Version 14: More Genomes, a New PANTHER GO-Slim and Improvements in Enrichment Analysis Tools.” Nucleic Acids Research 47 (D1): D419–26.
Mootha, Vamsi K, Cecilia M Lindgren, Karl-Fredrik Eriksson, Aravind Subramanian, Smita Sihag, Joseph Lehar, Pere Puigserver, et al. 2003. “PGC-1\(\alpha\)-Responsive Genes Involved in Oxidative Phosphorylation Are Coordinately Downregulated in Human Diabetes.” Nature Genetics 34 (3): 267–73.
Nordmark, Gunnel, Gudlaug Kristjansdottir, Elke Theander, Per Eriksson, JG Brun, Chuan Wang, L Padyukov, et al. 2009. “Additive Effects of the Major Risk Alleles of IRF5 and STAT4 in Primary Sjögren’s Syndrome.” Genes & Immunity 10 (1): 68–76.
Ohoka, Nobumichi, Satoshi Yoshii, Takayuki Hattori, Kikuo Onozaki, and Hidetoshi Hayashi. 2005. “TRB3, a Novel ER Stress-Inducible Gene, Is Induced via ATF4–CHOP Pathway and Is Involved in Cell Death.” The EMBO Journal 24 (6): 1243–55.
Shannon, P., A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker. 2003. “Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks.” Genome Research 13: 2498–2504.
Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50.
Takaoka, Akinori, Hideyuki Yanai, Seiji Kondo, Gordon Duncan, Hideo Negishi, Tatsuaki Mizutani, Shin-ichi Kano, et al. 2005. “Integral Role of IRF-5 in the Gene Induction Programme Activated by Toll-Like Receptors.” Nature 434 (7030): 243–49.